--- name: imaging-data-commons description: Use idc-index to query and download public cancer imaging data from NCI Imaging Data Commons. Used to access large-scale radiology (CT, MR, PET) and pathology datasets for AI training or research. No authentication required. Supports metadata querying, in-browser visualization, and license checking. license: MIT author: aipoch --- > **Source**: [https://github.com/aipoch/medical-research-skills](https://github.com/aipoch/medical-research-skills) # Imaging Data Commons ## When to Use - Use this skill when you need use idc-index to query and download public cancer imaging data from nci imaging data commons. used to access large-scale radiology (ct, mr, pet) and pathology datasets for ai training or research. no authentication required. supports metadata querying, in-browser visualization, and license checking in a reproducible workflow. - Use this skill when a data analytics task needs a packaged method instead of ad-hoc freeform output. - Use this skill when the user expects a concrete deliverable, validation step, or file-based result. - Use this skill when `the documented workflow in this package` is the most direct path to complete the request. - Use this skill when you need the `imaging-data-commons` package behavior rather than a generic answer. ## Key Features - Scope-focused workflow aligned to: Use idc-index to query and download public cancer imaging data from NCI Imaging Data Commons. Used to access large-scale radiology (CT, MR, PET) and pathology datasets for AI training or research. No authentication required. Supports metadata querying, in-browser visualization, and license checking. - Documentation-first workflow with no packaged script requirement. - Reference material available in `references/` for task-specific guidance. - Structured execution path designed to keep outputs consistent and reviewable. ## Dependencies - `Python`: `3.10+`. Repository baseline for current packaged skills. - `Third-party packages`: `not explicitly version-pinned in this skill package`. Add pinned versions if this skill needs stricter environment control. ## Example Usage ```text Skill directory: 20260316/scientific-skills/Data Analytics/imaging-data-commons No packaged executable script was detected. Use the documented workflow in SKILL.md together with the references/assets in this folder. ``` Example run plan: 1. Read the skill instructions and collect the required inputs. 2. Follow the documented workflow exactly. 3. Use packaged references/assets from this folder when the task needs templates or rules. 4. Return a structured result tied to the requested deliverable. ## Implementation Details See `## Overview` above for related details. - Execution model: validate the request, choose the packaged workflow, and produce a bounded deliverable. - Input controls: confirm the source files, scope limits, output format, and acceptance criteria before running any script. - Primary implementation surface: instruction-only workflow in `SKILL.md`. - Reference guidance: `references/` contains supporting rules, prompts, or checklists. - Parameters to clarify first: input path, output path, scope filters, thresholds, and any domain-specific constraints. - Output discipline: keep results reproducible, identify assumptions explicitly, and avoid undocumented side effects. ## Overview Use the `idc-index` Python package to query and download public cancer imaging data from the National Cancer Institute (NCI) Imaging Data Commons (IDC). No authentication required to access data. **Core Tool:** `idc-index` ([GitHub](https://github.com/imagingdatacommons/idc-index)) **View latest data scale:** ```python from idc_index import IDCClient client = IDCClient() # Get IDC data version print(client.get_idc_version()) # Get collection count and total series count stats = client.sql_query(""" SELECT COUNT(DISTINCT collection_id) as collections, COUNT(DISTINCT analysis_result_id) as analysis_results, COUNT(DISTINCT PatientID) as patients, COUNT(DISTINCT StudyInstanceUID) as studies, COUNT(DISTINCT SeriesInstanceUID) as series, SUM(instanceCount) as instances, SUM(series_size_MB)/1000000 as size_TB FROM index """) print(stats) ``` **Core Workflow:** 1. Query metadata → `client.sql_query()` 2. Download DICOM files → `client.download_from_selection()` 3. Visualize in browser → `client.get_viewer_URL(seriesInstanceUID=...)` ## When to Use This Skill - Finding publicly available radiology (CT, MR, PET) or pathology (slide microscopy) images - Filtering image subsets by cancer type, imaging modality, anatomical site, or other metadata - Downloading DICOM data from IDC - Checking data licenses before research or commercial use - Visualizing medical images in the browser without needing local DICOM viewer software ## IDC Data Model IDC adds two grouping levels above the standard DICOM hierarchy (Patient → Study → Series → Instance): - **collection_id**: Groups patients by disease, modality, or research focus (e.g., `tcga_luad`, `nlst`). A patient belongs to only one collection. - **analysis_result_id**: Identifies derived objects (segmentations, annotations, radiomics features) across one or more original collections. Use `collection_id` to find original imaging data (may include annotations stored with images); use `analysis_result_id` to find AI-generated or expert-annotated annotations. **Key Identifiers for Querying:** | Identifier | Scope | Purpose | |------------|-------|---------| | `collection_id` | Dataset grouping | Filter by project/study | | `PatientID` | Patient | Group images by patient | | `StudyInstanceUID` | DICOM study | Group related series, visualization | | `SeriesInstanceUID` | DICOM series | Group related series, visualization | ## Index Tables The `idc-index` package provides multiple metadata index tables accessible via SQL or pandas DataFrame. **Important:** Use `client.indices_overview` to get current table descriptions and column schemas. This is the authoritative source for available columns and their types—always consult this when writing SQL or exploring data structures. ### Available Tables | Table Name | Row Granularity | Loading Method | Description | |-------|-----------------|--------|-------------| | `index` | 1 row = 1 DICOM series | Auto | Core metadata for all current IDC data | | `prior_versions_index` | 1 row = 1 DICOM series | Auto | Series from previous IDC versions; for downloading deprecated data | | `collections_index` | 1 row = 1 collection | fetch_index() | Collection-level metadata and descriptions | | `analysis_results_index` | 1 row = 1 analysis result set | fetch_index() | Metadata about derived datasets (annotations, segmentations) | | `clinical_index` | 1 row = 1 clinical data column | fetch_index() | Mapping of clinical table columns to collections | | `sm_index` | 1 row = 1 slide microscopy series | fetch_index() | Slide microscopy (pathology) series metadata | | `sm_instance_index` | 1 row = 1 slide microscopy instance | fetch_index() | Instance-level metadata for slide microscopy (SOPInstanceUID) | | `seg_index` | 1 row = 1 DICOM segmentation series | fetch_index() | Segmentation metadata: algorithm, number of segments, reference to source image series | **Auto** = Automatically loaded when `IDCClient()` is instantiated **fetch_index()** = Requires executing `client.fetch_index("table_name")` to load ### Joining Tables **Key columns are not explicitly marked; here is a subset that can be used for joins.** | Join Column | Tables | Use Case | |-------------|--------|----------| | `collection_id` | index, prior_versions_index, collections_index, clinical_index | Link series to collection metadata or clinical data | | `SeriesInstanceUID` | index, prior_versions_index, sm_index, sm_instance_index | Cross-table linking of series; connect to slide microscopy details | | `StudyInstanceUID` | index, prior_versions_index | Link studies across current and historical data | | `PatientID` | index, prior_versions_index | Link patients across current and historical data | | `analysis_result_id` | index, analysis_results_index | Link series to analysis result metadata (annotations, segmentations) | | `source_DOI` | index, analysis_results_index | Link via publication DOI | | `crdc_series_uuid` | index, prior_versions_index | Link via CRDC unique identifier | | `Modality` | index, prior_versions_index | Filter by imaging modality | | `SeriesInstanceUID` | index, seg_index | Link segmentation series to its index metadata | | `segmented_SeriesInstanceUID` | seg_index → index | Link segmentation to its source image series (join seg_index.segmented_SeriesInstanceUID = index.SeriesInstanceUID) | **Note:** `Subjects`, `Updated`, and `Description` appear in multiple tables but with different meanings (counts vs. identifiers, different update contexts). **Join Examples:** ```python from idc_index import IDCClient client = IDCClient() # Join index with collections_index to get cancer types client.fetch_index("collections_index") result = client.sql_query(""" SELECT i.SeriesInstanceUID, i.Modality, c.CancerTypes, c.TumorLocations FROM index i JOIN collections_index c ON i.collection_id = c.collection_id WHERE i.Modality = 'MR' LIMIT 10 """) # Join index with sm_index to get slide microscopy details client.fetch_index("sm_index") result = client.sql_query(""" SELECT i.collection_id, i.PatientID, s.ObjectiveLensPower, s.min_PixelSpacing_2sf FROM index i JOIN sm_index s ON i.SeriesInstanceUID = s.SeriesInstanceUID LIMIT 10 """) # Join seg_index with index to find segmentations and their source images client.fetch_index("seg_index") result = client.sql_query(""" SELECT s.SeriesInstanceUID as seg_series, s.AlgorithmName, s.total_segments, src.collection_id, src.Modality as source_modality, src.BodyPartExamined FROM seg_index s JOIN index src ON s.segmented_SeriesInstanceUID = src.SeriesInstanceUID WHERE s.AlgorithmType = 'AUTOMATIC' LIMIT 10 """) ``` ### Accessing Index Tables **Via SQL (recommended for filtering/aggregation):** ```python from idc_index import IDCClient client = IDCClient() # Query main index (always available) results = client.sql_query("SELECT * FROM index WHERE Modality = 'CT' LIMIT 10") # Fetch and query additional indexes client.fetch_index("collections_index") collections = client.sql_query("SELECT collection_id, CancerTypes, TumorLocations FROM collections_index") client.fetch_index("analysis_results_index") analysis = client.sql_query("SELECT * FROM analysis_results_index LIMIT 5") ``` **As pandas DataFrame (direct access):** ```python # Main index (always available after client initialization) df = client.index # Fetch and access on-demand indexes client.fetch_index("sm_index") sm_df = client.sm_index ``` ### Discovering Table Schemas (Key to Writing Queries) The `indices_overview` dictionary contains complete schema information for all tables. **Always refer to this information when writing queries or exploring data structures.** **DICOM Attribute Mapping:** Many columns are populated directly from DICOM attributes in source files. Column descriptions in the schema indicate whether a column corresponds to a DICOM attribute (e.g., "DICOM Modality attribute" or reference to a DICOM tag). This allows leveraging DICOM knowledge at query time—standard DICOM attribute names like `PatientID`, `StudyInstanceUID`, `Modality`, `BodyPartExamined` all work as expected. ```python from idc_index import IDCClient client = IDCClient() # List all available indexes with their descriptions for name, info in client.indices_overview.items(): print(f"\n{name}:") print(f" Installed: {info['installed']}") print(f" Description: {info['description']}") # Get full schema for a specific index (columns, types, descriptions) schema = client.indices_overview["index"]["schema"] print(f"\nTable: {schema['table_description']}") print("\nColumns:") for col in schema['columns']: desc = col.get('description', 'No description') # Description indicates whether column comes from DICOM attribute print(f" {col['name']} ({col['type']}): {desc}") # Find columns that belong to DICOM attributes (check if description contains "DICOM") dicom_cols = [c['name'] for c in schema['columns'] if 'DICOM' in c.get('description', '').upper()] print(f"\nDICOM-sourced columns: {dicom_cols}") ``` **Alternative: Use `get_index_schema()` method:** ```python schema = client.get_index_schema("index") # Returns the same schema dictionary: {'table_description': ..., 'columns': [...]} ``` ### Key Columns in Primary `index` Table Most commonly used columns in queries (full list and descriptions available in `indices_overview`): | Column Name | Type | DICOM | Description | |--------|------|-------|-------------| | `collection_id` | STRING | No | IDC collection identifier | | `analysis_result_id` | STRING | No | If applicable, indicates the analysis result set this series belongs to | | `source_DOI` | STRING | No | DOI linking to dataset details; for further content info and attribution (see citation section below) | | `PatientID` | STRING | Yes | Patient identifier | | `StudyInstanceUID` | STRING | Yes | DICOM study UID | | `SeriesInstanceUID` | STRING | Yes | DICOM series UID — for download/viewing | | `Modality` | STRING | Yes | Imaging modality (CT, MR, PT, SM, etc.) | | `BodyPartExamined` | STRING | Yes | Anatomical site | | `SeriesDescription` | STRING | Yes | Series description | | `Manufacturer` | STRING | Yes | Device manufacturer | | `StudyDate` | STRING | Yes | Study execution date | | `PatientSex` | STRING | Yes | Patient sex | | `PatientAge` | STRING | Yes | Patient age at time of study | | `license_short_name` | STRING | No | License type (CC BY 4.0, CC BY-NC 4.0, etc.) | | `series_size_MB` | FLOAT | No | Series size in MB | | `instanceCount` | INTEGER | No | Number of DICOM instances in the series | **DICOM = Yes**: Column values extracted from DICOM attributes of the same name. For numeric tag mappings, refer to the [DICOM Standard](https://dicom.nema.org/medical/dicom/current/output/chtml/part06/chapter_6.html). Use standard DICOM knowledge to anticipate values and formats. ### Clinical Data Access ```python # Fetch clinical index (also downloads clinical data tables) client.fetch_index("clinical_index") # Query clinical index to find available tables and their columns tables = client.sql_query("SELECT DISTINCT table_name, column_label FROM clinical_index") # Load specific clinical table as DataFrame clinical_df = client.get_clinical_table("table_name") ``` For detailed workflows including value mapping schemas and joining clinical data with imaging data, see `references/clinical_data_guide.md`. ## Data Access Options | Method | Authentication Required | Best For | |--------|---------------|----------| | `idc-index` | No | Key queries and downloads (recommended) | | IDC Portal | No | Interactive exploration, manual selection, browser-based download | | BigQuery | Yes (GCP account) | Complex queries, complete DICOM metadata | | DICOMweb proxy | No | Tool integration via DICOMweb API | | Cloud storage (S3/GCS) | No | Direct file access, batch downloads, custom pipelines | **Cloud Storage Organization** All DICOM files mirrored between AWS S3 and Google Cloud Storage are stored in public cloud storage buckets. Files are organized by CRDC UUID (not DICOM UID) to support versioning. | Bucket (AWS / GCS) | License | Content | |--------------------|---------|---------| | `idc-open-data` / `idc-open-data` | No commercial restrictions | >90% of IDC data | | `idc-open-data-two` / `idc-open-idc1` | No commercial restrictions | Collections that may contain head scans | | `idc-open-data-cr` / `idc-open-cr` | Commercial use restricted (CC BY-NC) | ~4% of data | File storage format is `/.dcm`. Freely accessible via AWS CLI, gsutil, or s5cmd with anonymous access (no bandwidth fees). Use the `series_aws_url` column in the index to get S3 URL; GCS uses the same path structure. For bucket details, access commands, UUID mapping, and versioning, see `references/cloud_storage_guide.md`. **DICOMweb Access** IDC data is available via DICOMweb interface (implemented on Google Cloud Healthcare API) for integrating PACS systems and DICOMweb-compatible tools. | Endpoint | Authentication | Use Case | |----------|------|----------| | Public proxy | No | Testing, moderate queries, daily quota | | Google Healthcare | Yes (GCP) | Production use, higher quotas | For endpoint URLs, code examples, supported operations, and implementation details, see `references/dicomweb_guide.md`. ## Installation and Setup **Required (basic access):** ```bash pip install --upgrade idc-index ``` **Important:** Each IDC data release triggers a new version of `idc-index`. Unless you need an older version for reproducibility, always use the `--upgrade` flag when installing. **Tested version:** idc-index 0.11.7 (IDC data version v23) **Optional (data analysis):** ```bash pip install pandas numpy pydicom ``` ## Core Capabilities ### 1. Data Discovery and Exploration Discover available imaging collections and data in IDC: ```python from idc_index import IDCClient client = IDCClient() # Get summary statistics from main index query = """ SELECT collection_id, COUNT(DISTINCT PatientID) as patients, COUNT(DISTINCT SeriesInstanceUID) as series, SUM(series_size_MB) as size_mb FROM index GROUP BY collection_id ORDER BY patients DESC """ collections_summary = client.sql_query(query) # For richer collection metadata, use collections_index client.fetch_index("collections_index") collections_info = client.sql_query(""" SELECT collection_id, CancerTypes, TumorLocations, Species, Subjects, SupportingData FROM collections_index """) # For analysis results (annotations, segmentations), use analysis_results_index client.fetch_index("analysis_results_index") analysis_info = client.sql_query(""" SELECT analysis_result_id, analysis_result_title, Subjects, Collections, Modalities FROM analysis_results_index """) ``` **`collections_index`** provides curated metadata for each collection: cancer types, tumor locations, species, subject counts, and supported data types, without needing aggregation from the main index. **`analysis_results_index`** lists derived datasets (AI segmentations, expert annotations, radiomics features) with their source collections and modalities. ### 2. Querying Metadata with SQL Use SQL to query the IDC mini-index to find specific datasets. **First, explore optional filter values:** ```python from idc_index import IDCClient client = IDCClient() # Check what Modality values exist modalities = client.sql_query(""" SELECT DISTINCT Modality, COUNT(*) as series_count FROM index GROUP BY Modality ORDER BY series_count DESC """) print(modalities) # Check what BodyPartExamined values exist for MR modality body_parts = client.sql_query(""" SELECT DISTINCT BodyPartExamined, COUNT(*) as series_count FROM index WHERE Modality = 'MR' AND BodyPartExamined IS NOT NULL GROUP BY BodyPartExamined ORDER BY series_count DESC LIMIT 20 """) print(body_parts) ``` **Then query using validated filter values:** ```python # Find breast MRI scans (using actual values from exploration above) results = client.sql_query(""" SELECT collection_id, PatientID, SeriesInstanceUID, Modality, SeriesDescription, license_short_name FROM index WHERE Modality = 'MR' AND BodyPartExamined = 'BREAST' LIMIT 20 """) # Access results as pandas DataFrame for idx, row in results.iterrows(): print(f"Patient: {row['PatientID']}, Series: {row['SeriesInstanceUID']}") ``` **Filter by cancer type, requires joining with `collections_index`:** ```python client.fetch_index("collections_index") results = client.sql_query(""" SELECT i.collection_id, i.PatientID, i.SeriesInstanceUID, i.Modality FROM index i JOIN collections_index c ON i.collection_id = c.collection_id WHERE c.CancerTypes LIKE '%Breast%' AND i.Modality = 'MR' LIMIT 20 """) ``` **Available metadata fields** (full list use `client.indices_overview`): - Identifiers: collection_id, PatientID, StudyInstanceUID, SeriesInstanceUID - Imaging: Modality, BodyPartExamined, Manufacturer, ManufacturerModelName - Clinical: PatientAge, PatientSex, StudyDate - Descriptions: StudyDescription, SeriesDescription - License: license_short_name **Note:** Cancer types are in `collections_index.CancerTypes`, not in the main `index` table. ### 3. Downloading DICOM Files Efficiently download imaging data from IDC cloud storage: **Download entire collection:** ```python from idc_index import IDCClient client = IDCClient() # Download small collection (RIDER Pilot ~1GB) client.download_from_selection( collection_id="rider_pilot", downloadDir="./data/rider" ) ``` **Download specific series:** ```python # First, query for series UIDs series_df = client.sql_query(""" SELECT SeriesInstanceUID FROM index WHERE Modality = 'CT' AND BodyPartExamined = 'CHEST' AND collection_id = 'nlst' LIMIT 5 """) # Download only these series client.download_from_selection( seriesInstanceUID=list(series_df['SeriesInstanceUID'].values), downloadDir="./data/lung_ct" ) ``` **Custom directory structure:** Default `dirTemplate` is: `%collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID` ```python # Simplified hierarchy (omit StudyInstanceUID level) client.download_from_selection( collection_id="tcga_luad", downloadDir="./data", dirTemplate="%collection_id/%PatientID/%Modality" ) # Result path: ./data/tcga_luad/TCGA-05-4244/CT/ # Flat structure (all files in same directory) client.download_from_selection( seriesInstanceUID=list(series_df['SeriesInstanceUID'].values), downloadDir="./data/flat", dirTemplate="" ) # Result path: ./data/flat/*.dcm ``` ### Command-Line Download The `idc download` command provides command-line access to download functionality without writing Python code. Available after installing `idc-index`. **Auto-detect input type:** Manifest file path, or identifier (collection_id, PatientID, StudyInstanceUID, SeriesInstanceUID, crdc_series_uuid). ```bash # Download entire collection idc download rider_pilot --download-dir ./data # Download specific series by UID idc download "1.3.6.1.4.1.9328.50.1.69736" --download-dir ./data # Download multiple items (comma-separated) idc download "tcga_luad,tcga_lusc" --download-dir ./data # Download from manifest file (auto-detect) idc download manifest.txt --download-dir ./data ``` **Options:** | Option | Description | |--------|-------------| | `--download-dir` | Output directory (default: current directory) | | `--dir-template` | Directory hierarchy template (default: `%collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID`) | | `--log-level` | Logging verbosity: debug, info, warning, error, critical | **Manifest files:** Manifest files contain S3 URLs (one per line), which can be: - Exported from IDC Portal after queue selection - Shared by collaborators for reproducible data access - Programmatically generated from query results Format (one S3 URL per line): ``` s3://idc-open-data/cb09464a-c5cc-4428-9339-d7fa87cfe837/* s3://idc-open-data/88f3990d-bdef-49cd-9b2b-4787767240f2/* ``` **Example: Generate manifest from Python query:** ```python from idc_index import IDCClient client = IDCClient() # Query series URLs results = client.sql_query(""" SELECT series_aws_url FROM index WHERE collection_id = 'rider_pilot' AND Modality = 'CT' """) # Save as manifest file with open('ct_manifest.txt', 'w') as f: for url in results['series_aws_url']: f.write(url + '\n') ``` Then download: ```bash idc download ct_manifest.txt --download-dir ./ct_data ``` ### 4. Visualizing IDC Images View DICOM data in the browser without downloading: ```python from idc_index import IDCClient import webbrowser client = IDCClient() # First query to get valid UIDs results = client.sql_query(""" SELECT SeriesInstanceUID, StudyInstanceUID FROM index WHERE collection_id = 'rider_pilot' AND Modality = 'CT' LIMIT 1 """) # View single series viewer_url = client.get_viewer_URL(seriesInstanceUID=results.iloc[0]['SeriesInstanceUID']) webbrowser.open(viewer_url) # View all series in a study (useful for multi-series exams like MRI protocols) viewer_url = client.get_viewer_URL(studyInstanceUID=results.iloc[0]['StudyInstanceUID']) webbrowser.open(viewer_url) ``` This method automatically selects OHIF v3 for radiology or SLIM for slide microscopy. Viewing by study is very useful when a DICOM study contains multiple series (e.g., T1, T2, DWI sequences in a single MRI session). ### 5. Understanding and Checking Licenses Check data licenses before use (critical for commercial applications): ```python from idc_index import IDCClient client = IDCClient() # Check licenses for all collections query = """ SELECT DISTINCT collection_id, license_short_name, COUNT(DISTINCT SeriesInstanceUID) as series_count FROM index GROUP BY collection_id, license_short_name ORDER BY collection_id """ licenses = client.sql_query(query) print(licenses) ``` **License types in IDC:** - **CC BY 4.0** / **CC BY 3.0** (~97% of data) - Allows commercial use with attribution - **CC BY-NC 4.0** / **CC BY-NC 3.0** (~3% of data) - Non-commercial use only - **Custom license** (rare) - Some collections have specific terms (e.g., NLM terms) **Important:** Always check the license before using IDC data in publications or commercial applications. Each DICOM file's metadata tags its specific license. ### Generating Citations for Attribution The `source_DOI` column contains DOIs pointing to publications describing the data generation process. To meet attribution requirements, use `citations_from_selection()` to generate properly formatted citations: ```python from idc_index import IDCClient client = IDCClient() # Get citations for collection (APA format by default) citations = client.citations_from_selection(collection_id="rider_pilot") for citation in citations: print(citation) # Get citations for specific series results = client.sql_query(""" SELECT SeriesInstanceUID FROM index WHERE collection_id = 'tcga_luad' LIMIT 5 """) citations = client.citations_from_selection( seriesInstanceUID=list(results['SeriesInstanceUID'].values) ) # Alternative format: BibTeX (for LaTeX documents) bibtex_citations = client.citations_from_selection( collection_id="tcga_luad", citation_format=IDCClient.CITATION_FORMAT_BIBTEX ) ``` **Parameters:** - `collection_id`: Filter by collection - `patientId`: Filter by patient ID - `studyInstanceUID`: Filter by study UID - `seriesInstanceUID`: Filter by series UID - `citation_format`: Use `IDCClient.CITATION_FORMAT_*` constants: - `CITATION_FORMAT_APA` (default) - APA style - `CITATION_FORMAT_BIBTEX` - BibTeX for LaTeX - `CITATION_FORMAT_JSON` - CSL JSON - `CITATION_FORMAT_TURTLE` - RDF Turtle **Best Practice:** When publishing results obtained using IDC data, include generated citations to properly attribute data sources and meet license requirements. ### 6. Batch Processing and Filtering Efficiently process large-scale datasets through filtering: ```python from idc_index import IDCClient import pandas as pd client = IDCClient() # Find chest CT scans from GE scanners query = """ SELECT SeriesInstanceUID, PatientID, collection_id, ManufacturerModelName FROM index WHERE Modality = 'CT' AND BodyPartExamined = 'CHEST' AND Manufacturer = 'GE MEDICAL SYSTEMS' AND license_short_name = 'CC BY 4.0' LIMIT 100 """ results = client.sql_query(query) # Save manifest for later use results.to_csv('lung_ct_manifest.csv', index=False) # Download in batches to avoid timeouts batch_size = 10 for i in range(0, len(results), batch_size): batch = results.iloc[i:i+batch_size] client.download_from_selection( seriesInstanceUID=list(batch['SeriesInstanceUID'].values), downloadDir=f"./data/batch_{i//batch_size}" ) ``` ### 7. Advanced Queries with BigQuery For queries requiring complete DICOM metadata, complex JOINs, clinical data tables, or private DICOM elements, use Google BigQuery. Requires a GCP project with billing enabled. **Quick Reference:** - Dataset: `bigquery-public-data.idc_current.*` - Main table: `dicom_all` (merged metadata) - Complete metadata: `dicom_metadata` (all DICOM tags) - Private elements: `OtherElements` column (vendor-specific tags like diffusion b-values) For setup, table schemas, query patterns, private element access, and cost optimization, see `references/bigquery_guide.md`. ### 8. Tool Selection Guide | Task | Tool | Reference | |------|------|-----------| | Programmatic query and download | `idc-index` | This document | | Interactive exploration | IDC Portal | https://portal.imaging.datacommons.cancer.gov/ | | Complex metadata queries | BigQuery | `references/bigquery_guide.md` | | 3D visualization and analysis | SlicerIDCBrowser | https://github.com/ImagingDataCommons/SlicerIDCBrowser | **Default Choice:** `idc-index` is recommended for the vast majority of tasks (no authentication, simple API, supports batch downloads). ### 9. Integration with Analysis Pipelines Integrate IDC data into imaging analysis workflows: **Read downloaded DICOM files:** ```python import pydicom import os # Read DICOM files from downloaded series series_dir = "./data/rider/rider_pilot/RIDER-1007893286/CT_1.3.6.1..." dicom_files = [os.path.join(series_dir, f) for f in os.listdir(series_dir) if f.endswith('.dcm')] # Load first image ds = pydicom.dcmread(dicom_files[0]) print(f"Patient ID: {ds.PatientID}") print(f"Modality: {ds.Modality}") print(f"Image shape: {ds.pixel_array.shape}") ``` **Build 3D volume from CT series:** ```python import pydicom import numpy as np from pathlib import Path def load_ct_series(series_path): """Load CT series as 3D numpy array""" files = sorted(Path(series_path).glob('*.dcm')) slices = [pydicom.dcmread(str(f)) for f in files] # Sort by slice position slices.sort(key=lambda x: float(x.ImagePositionPatient[2])) # Stack into 3D array volume = np.stack([s.pixel_array for s in slices]) return volume, slices[0] # Return volume and first slice for metadata volume, metadata = load_ct_series("./data/lung_ct/series_dir") print(f"Volume shape: {volume.shape}") # (z, y, x) ``` **Integration with SimpleITK:** ```python import SimpleITK as sitk from pathlib import Path # Read DICOM series series_path = "./data/ct_series" reader = sitk.ImageSeriesReader() dicom_names = reader.GetGDCMSeriesFileNames(series_path) reader.SetFileNames(dicom_names) image = reader.Execute() # Apply processing (smoothing filter) smoothed = sitk.CurvatureFlow(image1=image, timeStep=0.125, numberOfIterations=5) # Save as NIfTI format sitk.WriteImage(smoothed, "processed_volume.nii.gz") ``` ## Common Use Cases ### Use Case 1: Find and Download Lung CT Scans for Deep Learning **Goal:** Build training dataset of lung CT scans from NLST collection **Steps:** ```python from idc_index import IDCClient client = IDCClient() # 1. Query lung CT scans with specific criteria query = """ SELECT PatientID, SeriesInstanceUID, SeriesDescription FROM index WHERE collection_id = 'nlst' AND Modality = 'CT' AND BodyPartExamined = 'CHEST' AND license_short_name = 'CC BY 4.0' ORDER BY PatientID LIMIT 100 """ results = client.sql_query(query) print(f"Found {len(results)} series from {results['PatientID'].nunique()} patients") # 2. Download data organized by patient client.download_from_selection( seriesInstanceUID=list(results['SeriesInstanceUID'].values), downloadDir="./training_data", dirTemplate="%collection_id/%PatientID/%SeriesInstanceUID" ) # 3. Save manifest for reproducibility results.to_csv('training_manifest.csv', index=False) ``` ### Use Case 2: Query Brain MRI by Manufacturer for Quality Research **Goal:** Compare image quality across different MRI scanner manufacturers **Steps:** ```python from idc_index import IDCClient import pandas as pd client = IDCClient() # Query brain MRI grouped by manufacturer query = """ SELECT Manufacturer, ManufacturerModelName, COUNT(DISTINCT SeriesInstanceUID) as num_series, COUNT(DISTINCT PatientID) as num_patients FROM index WHERE Modality = 'MR' AND BodyPartExamined LIKE '%BRAIN%' GROUP BY Manufacturer, ManufacturerModelName HAVING num_series >= 10 ORDER BY num_series DESC """ manufacturers = client.sql_query(query) print(manufacturers) # Download samples from each manufacturer for comparison for _, row in manufacturers.head(3).iterrows(): mfr = row['Manufacturer'] model = row['ManufacturerModelName'] query = f""" SELECT SeriesInstanceUID FROM index WHERE Manufacturer = '{mfr}' AND ManufacturerModelName = '{model}' AND Modality = 'MR' AND BodyPartExamined LIKE '%BRAIN%' LIMIT 5 """ series = client.sql_query(query) client.download_from_selection( seriesInstanceUID=list(series['SeriesInstanceUID'].values), downloadDir=f"./quality_study/{mfr.replace(' ', '_')}" ) ``` ### Use Case 3: Preview Series Without Downloading **Goal:** Preview imaging data before deciding to download ```python from idc_index import IDCClient import webbrowser client = IDCClient() series_list = client.sql_query(""" SELECT SeriesInstanceUID, PatientID, SeriesDescription FROM index WHERE collection_id = 'acrin_nsclc_fdg_pet' AND Modality = 'PT' LIMIT 10 """) # Preview each series in browser for _, row in series_list.iterrows(): viewer_url = client.get_viewer_URL(seriesInstanceUID=row['SeriesInstanceUID']) print(f"Patient {row['PatientID']}: {row['SeriesDescription']}") print(f" View at: {viewer_url}") # webbrowser.open(viewer_url) # Uncomment to auto-open ``` For more visualization options, see [IDC Portal Getting Started Guide](https://learn.canceridc.dev/portal/getting-started) or [SlicerIDCBrowser](https://github.com/ImagingDataCommons/SlicerIDCBrowser) for 3D Slicer integration. ### Use Case 4: License-Sensitive Batch Download for Commercial Use **Goal:** Only download CC-BY licensed data suitable for commercial use **Steps:** ```python from idc_index import IDCClient client = IDCClient() # Query only CC BY licensed data (allows commercial use with attribution) query = """ SELECT SeriesInstanceUID, collection_id, PatientID, Modality FROM index WHERE license_short_name LIKE 'CC BY%' AND license_short_name NOT LIKE '%NC%' AND Modality IN ('CT', 'MR') AND BodyPartExamined IN ('CHEST', 'BRAIN', 'ABDOMEN') LIMIT 200 """ cc_by_data = client.sql_query(query) print(f"Found {len(cc_by_data)} CC BY licensed series") print(f"Collections: {cc_by_data['collection_id'].unique()}") # Download after license verification client.download_from_selection( seriesInstanceUID=list(cc_by_data['SeriesInstanceUID'].values), downloadDir="./commercial_dataset", dirTemplate="%collection_id/%Modality/%PatientID/%SeriesInstanceUID" ) # Save license information cc_by_data.to_csv('commercial_dataset_manifest_CC-BY_ONLY.csv', index=False) ``` ## Best Practices - **Check license before use** - Always query `license_short_name` field and comply with license terms (CC BY vs CC BY-NC). - **Generate citations for attribution** - Use `citations_from_selection()` to get properly formatted citations from `source_DOI` values and include them in publications. - **Start with small queries** - Use `LIMIT` clause when exploring to avoid long downloads and help understand data structure. - **Use mini-index for simple queries** - Only use BigQuery when you need comprehensive metadata or complex JOINs. - **Organize downloads with dirTemplate** - Use meaningful directory structures like `%collection_id/%PatientID/%Modality`. - **Cache query results** - Save DataFrames as CSV files to avoid repeated queries and ensure reproducibility. - **Estimate size before downloading** - Check collection sizes before downloading—some collections are TB in size! - **Save manifests** - Always save query results with series UIDs for reproducibility and data provenance tracking. - **Read documentation** - IDC data structure and metadata fields are documented at https://learn.canceridc.dev/. - **Use IDC forum** - Search for answers at https://discourse.canceridc.dev/, or ask questions to IDC maintainers and users. ## Troubleshooting **Problem: `ModuleNotFoundError: No module named 'idc_index'`** - **Cause:** idc-index package not installed. - **Solution:** Install with `pip install --upgrade idc-index`. **Problem: Download fails with connection timeout** - **Cause:** Unstable network or too large download. - **Solution:** - Download in small batches (e.g., 10-20 series at a time). - Check network connection. - Use `dirTemplate` to organize downloads by batch. - Implement retry logic with delays. **Problem: `BigQuery quota exceeded` or billing error** - **Cause:** BigQuery requires a GCP project with billing enabled. - **Solution:** For simple queries, use idc-index mini-index (no cost), or see `references/bigquery_guide.md` for cost optimization tips. **Problem: Cannot find series UID or no data returned** - **Cause:** UID typo, data not in current IDC version, or wrong field name. - **Solution:** - Check if data is in current IDC version (some older data may be deprecated). - Test query with `LIMIT 5` first. - Verify field names against metadata schema documentation. **Problem: Downloaded DICOM files won't open** - **Cause:** Download corrupted or viewer incompatible. - **Solution:** - Check DICOM object type (Modality and SOPClassUID attributes) — some object types require specialized tools. - Verify file integrity (check file sizes). - Validate with pydicom: `pydicom.dcmread(file, force=True)`. - Try different DICOM viewers (3D Slicer, Horos, RadiAnt, QuPath). - Re-download the series. ## Common SQL Query Patterns Quick reference for common queries. Detailed contextual examples are in the "Core Capabilities" section above. ### Discover Available Filter Values ```python # What modalities exist? client.sql_query("SELECT DISTINCT Modality FROM index") # What body parts for a specific modality? client.sql_query(""" SELECT DISTINCT BodyPartExamined, COUNT(*) as n FROM index WHERE Modality = 'CT' AND BodyPartExamined IS NOT NULL GROUP BY BodyPartExamined ORDER BY n DESC """) # What manufacturers for MR? client.sql_query(""" SELECT DISTINCT Manufacturer, COUNT(*) as n FROM index WHERE Modality = 'MR' GROUP BY Manufacturer ORDER BY n DESC """) ``` ### Find Annotations and Segmentations **Note:** Not all image-derived objects belong to analysis result collections. Some annotations are stored with the original images. Use DICOM Modality or SOPClassUID to find all derived objects regardless of their collection type. ```python # Find all segmentations and structure sets via DICOM Modality # SEG = DICOM Segmentation, RTSTRUCT = Radiotherapy structure set client.sql_query(""" SELECT collection_id, Modality, COUNT(*) as series_count FROM index WHERE Modality IN ('SEG', 'RTSTRUCT') GROUP BY collection_id, Modality ORDER BY series_count DESC """) # Find segmentations for specific collection (including non-analysis result items) client.sql_query(""" SELECT SeriesInstanceUID, SeriesDescription, analysis_result_id FROM index WHERE collection_id = 'tcga_luad' AND Modality = 'SEG' """) # List analysis result collections (curated derived datasets) client.fetch_index("analysis_results_index") client.sql_query(""" SELECT analysis_result_id, analysis_result_title, Collections, Modalities FROM analysis_results_index """) # Find analysis results for a specific source collection client.sql_query(""" SELECT analysis_result_id, analysis_result_title FROM analysis_results_index WHERE Collections LIKE '%tcga_luad%' """) # Use seg_index for detailed DICOM segmentation metadata client.fetch_index("seg_index") # Get segmentation statistics by algorithm client.sql_query(""" SELECT AlgorithmName, AlgorithmType, COUNT(*) as seg_count FROM seg_index WHERE AlgorithmName IS NOT NULL GROUP BY AlgorithmName, AlgorithmType ORDER BY seg_count DESC LIMIT 10 """) # Find segmentations for specific source images (e.g., chest CT) client.sql_query(""" SELECT s.SeriesInstanceUID as seg_series, s.AlgorithmName, s.total_segments, s.segmented_SeriesInstanceUID as source_series FROM seg_index s JOIN index src ON s.segmented_SeriesInstanceUID = src.SeriesInstanceUID WHERE src.Modality = 'CT' AND src.BodyPartExamined = 'CHEST' LIMIT 10 """) # Find TotalSegmentator results in source image context client.sql_query(""" SELECT seg_info.collection_id, COUNT(DISTINCT s.SeriesInstanceUID) as seg_count, SUM(s.total_segments) as total_segments FROM seg_index s JOIN index seg_info ON s.SeriesInstanceUID = seg_info.SeriesInstanceUID WHERE s.AlgorithmName LIKE '%TotalSegmentator%' GROUP BY seg_info.collection_id ORDER BY seg_count DESC """) ``` ### Query Slide Microscopy Data ```python # sm_index contains detailed metadata; join with index to get collection_id client.fetch_index("sm_index") client.sql_query(""" SELECT i.collection_id, COUNT(*) as slides, MIN(s.min_PixelSpacing_2sf) as min_resolution FROM sm_index s JOIN index i ON s.SeriesInstanceUID = i.SeriesInstanceUID GROUP BY i.collection_id ORDER BY slides DESC """) ``` ### Estimate Download Size ```python # Total size for specific criteria client.sql_query(""" SELECT SUM(series_size_MB) as total_mb, COUNT(*) as series_count FROM index WHERE collection_id = 'nlst' AND Modality = 'CT' """) ``` ### Link to Clinical Data ```python client.fetch_index("clinical_index") # Find collections with clinical data and their corresponding tables client.sql_query(""" SELECT collection_id, table_name, COUNT(DISTINCT column_label) as columns FROM clinical_index GROUP BY collection_id, table_name ORDER BY collection_id """) ``` For complete schemas including value mappings and patient cohort selection, see `references/clinical_data_guide.md`. ## Related Skills The following skills complement the IDC workflow for downstream analysis and visualization: ### DICOM Processing - **pydicom** - Read, write, and manipulate downloaded DICOM files. Used to extract pixel data, read metadata, anonymize, and convert formats. Fundamental for processing IDC radiology data (CT, MR, PET). ### Pathology and Slide Microscopy - **histolab** - Lightweight slide extraction and preprocessing for Whole Slide Images (WSI). Used for basic slide processing, tissue detection, and dataset preparation of IDC slide microscopy data. - **pathml** - Full-featured computational pathology toolbox. Used for advanced WSI analysis, including multiplexed imaging, nucleus segmentation, and training machine learning models on pathology data downloaded from IDC. ### Metadata Visualization - **matplotlib** - Low-level plotting library with full customization. Used to create static charts summarizing IDC query results (modality bar charts, series count histograms, etc.). - **seaborn** - Statistical visualization library with pandas integration. Used for quick exploration of IDC metadata distributions, variable relationships, and categorical comparisons with beautiful default styles. - **plotly** - Interactive visualization library. Used when you need hover info, zoom and pan capabilities to explore IDC metadata, or to create embeddable web dashboards for collection statistics. ### Data Exploration - **exploratory-data-analysis** - Comprehensive EDA for scientific data files. Used after downloading IDC data to understand file structure, quality, and characteristics before analysis. ## Resources ### Schema Reference (Primary Source) **Always use `client.indices_overview` to get current column schemas.** This ensures consistency with your installed idc-index version: ```python # Get column names and types for any table schema = client.indices_overview["index"]["schema"] columns = [(c['name'], c['type'], c.get('description', '')) for c in schema['columns']] ``` ### Reference Documentation - **clinical_data_guide.md** - Clinical/tabular data navigation, value mappings, and linking with imaging data. - **cloud_storage_guide.md** - Direct access to cloud storage buckets (S3/GCS), file organization, CRDC UUID, versioning, and reproducibility. - **cli_guide.md** - Complete idc-index command-line interface reference (`idc download`, `idc download-from-manifest`, `idc download-from-selection`). - **bigquery_guide.md** - Advanced BigQuery usage guide for complex metadata queries. - **dicomweb_guide.md** - DICOMweb endpoint URLs, code examples, and Google Healthcare API implementation details. - **[indices_reference](https://idc-index.readthedocs.io/en/latest/indices_reference.html)** - External documentation for index tables (may be ahead of installed version). ### External Links - **IDC Portal**: https://portal.imaging.datacommons.cancer.gov/explore/ - **Official Documentation**: https://learn.canceridc.dev/ - **Tutorials**: https://github.com/ImagingDataCommons/IDC-Tutorials - **User Forum**: https://discourse.canceridc.dev/ - **idc-index GitHub**: https://github.com/ImagingDataCommons/idc-index - **Citation**: Fedorov, A., et al. "National Cancer Institute Imaging Data Commons: Toward Transparency, Reproducibility, and Scalability in Imaging Artificial Intelligence." RadioGraphics 43.12 (2023). https://doi.org/10.1148/rg.230180 ### Skill Updates This skill version can be found in the skill metadata. To check for updates: - Visit [Releases page](https://github.com/ImagingDataCommons/idc-claude-skill/releases) - Watch the repository on GitHub (Watch → Custom → Releases)